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This report discusses the implementation of two parallel algorithms on a dis- 
O ' tributed memory system for studying vortex dynamics in type-II superconductors. 

»' ! , These algorithms have been previously implemented for classical molecular dynamics 

|' simulation with short-range forces [1]. The run time for parallel algorithm is tested 

3 on a system containing upto 4 processors and compared with that for vectorized 

algorithm on a single processor for system size ranging from 120 to 4800 vortices. 

td: 

S I. INTRODUCTION 

The classical molecular dynamic (MD) simulation is a widely used computational method for 
studying the collective behaviour of interacting particles and appears in many diverse areas of 
I . 1 science. The MD simulation involves numerically solving the classical Newton's equation of motion 
for each particle starting from a set of initial conditions. The force on each particle in derived from 
the potential energy functional which contains the physics of the problem. Since the equation of 
motion of individual particle is solved, the MD simulation allows one to obtain details of the static 
i and dynamical properties at the smallest length scale in the problem. 

' The computational complexity of the MD lies in two factors : the number of particles N, and the 
(f-) , time scale of the phenomenon under simulation. A typical example is the melting of a simple cubic 
f^) • solid with dimension 50 lattice constants which requires 125,000 atoms simulation. Also, simulation 
t-H \ of few pico-seconds of the real time dynamics would correspond to thousands of time steps which can 
be prohibitive even for large computers. Considerable research has therefore gone into optimizing 
algorithms for MD simulation depending upon the problem in hand. With development of parallel 
machines, various scalable algorithms for MD simulation has been given which can be used on a 
system with few processors to hundreds of processors. These algorithms use the message-passing 
model of programming for inter-processor communication. Another advantage of writing simulation 
code using message passing is due to availability of standard library which allows portability to 
Q \ various computing platforms. 

In this report, we describe the implementation of two parallel MD algorithms for simulating 
2D vortex dynamics in type-II superconductors. The algorithm used here have been previously 
developed and implemented by others for simulating short range MD simulation [1,2] (see also Ref. 
[3] for a review). To a large extent, this report is based on the work by Plimpton on short-range MD 
[1]. The first algorithm uses the particle decomposition method wherein each processor is assigned 
a fixed number of particles. Each processor calculates the force on the same particles assigned to it 
irrespective of its spatial position. The second algorithm uses force decomposition method in which 
each processor calculates the sub-block of the force matrix. The parallelization is implemented using 
MPI and tested on a system with 4 processors. The purpose of this report is to present the simplicity 
of the algorithms in implementing and gain in run time compared to sequential algorithms on a single 
processor. 

The report is presented as follows : in section II, the computational aspect of the problem is 
described. Section III explains the two algorithms in detail, whereas section IV compares the runtime 
for the two algorithms with vectorized algorithm. 
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II. BASIC EQUATION FOR VORTEX DYNAMICS AND ITS COMPUTATIONAL 

ASPECT 



A vortex in the mixed phase of a bulk type-II superconductor is a line-like object which encom- 
passes a quantum of magnetic flux (|| where h is Planck constant, c is the velocity of light, and 
e charge of an electron) threaded by circulating super-current [4]. The interaction between two 
vortices is electromagnetic and hence of long range nature. Though the vortex origin requires a 
quantum mechanical explanation, most of the physical properties (magnetisation, transport) of the 
mixed phase can be described by considering the vortices as classical interacting objects. In 2D, 
these vortices can be treated as point massless particles obeying the following overdamped equation 
of motion 

^5 = E FW (*» ? i) + E FVp (n,Rj) + ? ext (i) 

j& 3 

where fl represents the co-ordinate of the z-th particle, and the first summation runs over all N — 1 
particles. This term represents the repulsive force between two vortices and is given by the first- 
order Bessel function K\{r). At distances r >• 1, the function decays rapidly as exp(— r). The 
second term represents the interaction between the vortex and the quenched disorder. The disorder 
co-ordinates R is distributed randomly in the simulation region, and the potential is attractive and 
short ranged. The last term is the force due to external condition that can include transport current. 
The above equation is a paradigm for driven interacting particles in presence of quenched disorder, 
and appears in other fields such charge-density wave (CDW), colloids, tribology. 

Computationally, the maximum processor time in a single MD time step goes in calculating the 
two-body force represented by the first term. Even with overhead computational expenses, this 
could be as high as 95% (for instance, with 4800 vortices and 2000 point disorder, the first term 
accounts for 98% of the processor time). Therefore, much of the effort has been directed to optimize 
algorithms for calculating the two-body force. Since the second term is quenched and short ranged, 
it can be calculated efficiently by binning, and summing the forces over the closest bins. The third 
term is a one-body term and hence be efficiently vectorized and takes small fraction of the total 
computation time. Rest of the report hence deals with the computational aspect of the first term 
only. 

It is useful to consider the positions of all particles as an ^-dimensional vector r = 

(fi,r2,f^j Hv) where each of the Fj holds the cZ-dimensional co-ordinate of the i-th particle 

(henceforth, particle is assumed to imply vortex). Thus, for the 2D case r*j = (xi,yi). The force 
Fij = F(fi,fj) between two particles at r, and fj can be thought as an clement of a matrix of 
size N x N. We denote such a force matrix by F. The total force on all N particles can then be 
represented as a column vector of length N such that the i-th element is given by fa = F^. 
Symbolically, such a vector f can be written as 

f=r®Fr (2) 

The MD simulation thus amounts to calculating f at each time step which is then used to update 
the vector r. 

The computational expense is reflected in the number of elements needed in the summation for 
fi. For long-range forces such as Coulomb force and gravitational force, each particle interacts with 
all (N — 1) particles, thus requiring N — 1 elements of F^ in the summation for /j. Therefore, a 
direct calculation of such forces scales as iV 2 , and becomes prohibitively expensive even for moderate 
values of N. In the last decade, various approximate methods have been developed which allows 
considerable run time gain. They include particle-mesh algorithms which scale as f(M)N where M 
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is the number of mesh points, hierarchical methods which goes as O(NlogN) [5], and fast multipolc 
methods which scale as N [6] . But most of these algorithms are difficult to implement and requires 
particular attention to data structure. 

On the other hand, direct methods are specifically suitable for short-range forces in which case 
each particle interacts with few particles within a distance r c . Typical examples are MD simulation 
employing Lennard- Jones potential and van-der Waals potential. In these simulations, a significant 
computational time is spend in searching for the particles within the cut-off distance r c . There 
are two techniques usually employed for doing this. The first method constructs list of all atoms 
within a distance r c + S where 6 is chosen to be small compared to r c . The list is updated after 
few timcstcps such that the total distance traveled by a particle between successive updates is less 
than S. The two-body force on a particle is found by summing over all particles within the list. The 
second method, known as link-cell method, employs binning of particles at every time step into cells 
of dimension r c . Since this requires O(N) operations, the overhead expenses more than compensates 
for the time required for searching particles within r c through the list of all N particles. The fastest 
MD algorithms use both these techniques together. 

For the interaction under consideration, the inter-particle force decays to a value w 10 -8 at a 
distance r c w 15 (in reduced units). This length scale is considerably larger (sometimes nearly half 
the system size) than usually employed in short-range MD, yet does not cover the whole system. 
Also, since the particle distribution is uniform, this leads to a significant fraction of total number 
of particles to be within the distance r c (for example, for 1200 particles in a square box of length 
36, there are on an average 200 particles within a distance r c ). The construction of neighbour list is 
therefore of not much advantage and can occupy large memory. Also in the situation where particles 
are driven, the position changes randomly which would require frequent updating of the list. If 
r c is as large as half the length of the simulation box, binning of particles is not expected to give 
significant time gain (on the other hand, this is used for calculating the force due to disorder which is 
short ranged). The MD simulation presented here therefore does not employ both these techniques. 

III. PARALLEL ALGORITHMS 

In the last decade, with the arrival of multi-processor machines, much effort has been spend in 
constructing parallel MD algorithms. It is said that the MD computations are inherently parallel 
in the sense that force calculation and updates can be done simultaneously on all particles. The 
aim of parallelizing MD is to distribute the task of force calculation evenly among all processors. 
The various parallel algorithms are based on two basic methods. In the first class of methods, a 
pre-determined set of computation is assigned to each processor that remains fixed throughout the 
simulation time. One way of doing is to assign a group of particles to each processor, and at any 
time the processor computes the force for the particles assigned to it no matter where the particles 
are located in the simulation region. This is usually referred as atom decomposition (AD) or 
replicated-data method. Another possible way is to subscribe sub-blocks of the force matrix 
computation to each processor that has led to force-decomposition (FD) algorithms. Both these 
methods are analogous to Lagrangian gridding in computational fluid dynamics (CFD) where the 
grid cells (computational elements or processors) move along with the fluid (particles or atoms in 
MD). These two algorithms are discussed in detail below. 

The second class of methods uses spatial decomposition (SD) wherein the simulation region is 
decomposed into cells and each cell is assigned to a processor. Each processor then computes the 
force on the particles within the cell region assigned to it. This is analogous to Eulerian gridding in 
CFD where the grid remains fixed and the fluid moves through it. In this report, we deal only with 
the implementation of atom, decomposition and force-decomposition algorithms. Due to small number 
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of processors available for computation, the spatial decomposition algorithm cannot be effectively 
compared with the others. 

For the purpose of further discussions, we assume following symbols: the column vectors r, f and 
the matrix F assigned to a processor with rank k is designated by the subscript k. The rank for the 
processors are indexed from to P — 1 where P is the total number of processors. Thus, and 
ffe are the position vector and total force vector held in processor k, and Ffe is the sub-block of the 
force matrix assigned to it. The number of particles (and hence the length of the vector and ffe) 
assigned to the fc-th processor is represented by Nk- In all the cases considered here Nk = N/P. 
For simplicity, N is chosen to be an integer multiple of P, and is not a constraint on the algorithm. 

A. Atom-Decomposition algorithm 

As mentioned earlier, in this method the computation is carried out by each of the P processors 
for Nk(= N/P) particles which are assigned to it at the start of the simulation. This amounts 
to assigning a sub-block of Nk rows of the force matrix F to each processors. The fc-th processor 
computes matrix elements of the sub-block Ffe . Let us assume that each processor has the updated 
co-ordinates of all particles initially assigned to it. To calculate the force for the next time step, 
the particles in fc-th processor requires positions of all other particles held by the P — 1 processors. 
This implies that at each time step, each of the processor needs to communicate the co-ordinates of 
particles held by it to all other processors. This kind of collective communication can be implemented 
by calling all-to-all operation in MPI. An efficient algorithm to perform the same operation has been 
given by Fox et al. [7] and is called an expand operation (see Appendix A for the details) . 

There are two versions of atom-decomposition method that is discussed in ref. [1]. The Al 
algorithm is a straightforward implementation of the above mentioned procedure. It does not use 
the skew-symmetric property of the force matrix F, and hence the two-body force F t j is calculated 
twice (one by the processor holding fi in the sub- vector, and the other holding fj). The Al algorithm 
is outlined below : 

1. Expand vector and construct r in each processor. 

2. Compute the sub-block Ffe using and r, and sum the elements into ffe in each processor. 

3. Update the sub- vector rfe in each processor using ffe. 

The communication cost of an algorithm is gauged by the number of messages (in MPI, a pair 
of MPLSEND and MPLRECV) and the volume of data exchanged. It would be fruitful to see both 
these factors for the above algorithm in each step. The first step expands the rfe in each processor to 
construct the full vector r. This uses Nk vector length for communication to P — I processors, and 
therefore scales as NkP — N (actually NkP — Nk). The second step computes the force on all the 
Nk particles in rfe. This operation scales as N ■ Nk — ^p-. And the last operation updates the vector 
Yk in each processor which also scales as Nk- Thus overall, the communication cost increases as N 
and computation cost as ^p-. Symbolically, the steps in the above algorithm can be represented as 

r = E P [r k ] 

ft = r k <g> F • r (3) 

where Ep is an expand operation discussed in Appendix A. Note that ffe is of length Nk- 

The second algorithm, denoted by A2, uses Newton's law to avoid double computing of the 
two-body force. Though this leads to extra communication between processors, this can still 
outperform the algorithm Al even for moderate values of N when number of processors P is 
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small. The A2 algorithm references the pair-wise interaction only once by using a modified 
force matrix G [1] , which is defined as follows : Gij = Eij , except that Gij = when i + j 
is even for i > j, and also when i + j is odd for i < j. This makes the matrix G appear as 
a checkerboard (with the diagonal also set to zero) as shown in Fig.l. The difference between 
Al and A2 is in the second step. The force between i-th and j-th particle is calculated, and 
summed into both i and j positions of the resulting force vector f . For e.g., from Fig.l, at 
the end of the force calculation, the processor and 1 will have following elements in the vec- 
tor f Processor Processor 1 



Fi.3 + Fi^ + Fij + Fi.c) + Fiji — ^2,1 

F2.I + F2.4 + i<2,6 + i*2,8 + -F240 + ^2,12 " 
^3,2 + ^3,5 + F$j + F^.g + F311 — Fi^ 

— F2,4 

— Fl,5 — ^3,5 
— -^2,6 

— Fl,7 — F 3t 7 

— F2,8 

— Fi t g — -F^g 

— -^2,10 

— -Fl.ll — F3.11 

— F2.12 



—Faa — Ft 



F 



3,2 



F^i + F 4j3 + F4.Q + F 4j8 

^5,2 + F 5 ^ + F 5 j , ^ 0j y 

F e ,i + F 6 ,3 + F 6 . 5 + F 6i8 + F 6 ,io 



- ^4,10 
^5,9 



4,1 

~ F 4i3 
+ F4.I2 
■f ^5,11 

+ F 6 . 12 

— F 4:8 

-F 4 ,i - 
-F4.12 



-F, 



6.1 

5,2 



— F&.3 

— F5.4 

— F e . 5 
-F ifi 
-F 5 , 7 

— F e .s 
-F B ,g 

— Fqaq 
5,11 

12 



-F t 
-F e . 



This means that the vector obtained as an output from the force calculation in each processor is 
again of length TV unlike Al where it is of size N^. The vector f is summed across all processors 
so that total force is obtained as the sub- vector ffc in each processor. This operation of summing 
is called fold and can be performed optimally by Fox's algorithm [7] (see Appendix B). The A2 
algorithm can then be enumerated as follows : 

1. Expand vector r k and construct r in each processor. 

2. Compute the sub-block of G using and r, and obtain f . 

3. Fold vector f across all processors and obtain f^. 

4. Update the sub- vector r k in each processor using ffe. 

Note that there is an additional communication (in step 3) of order TV as compared to Al at the 
expense of halving the force calculation. A gain in runtime is possible if the communication time is 
a small percentage of overall time which is not the case if P is large. As pointed out by Plimpton, in 
that case Al algorithm is faster than A2. But for the case where the force calculation is intensive 
and P is small, A2 can outperform Al which is indeed observed in our simulation. Symbolically, 
the A2 algorithm becomes 

r = E P [v k ] 
f ' = r k <g> G • r 

ffc = Rp[f'] (4) 

where Rp is a fold operation discussed in Appendix B. 

Both Al and A2 algorithms has communication operations which scales as N. This for large 
values of N can be prohibitive. The advantage with these algorithms is its simplicity, which allows 
parallelizing an existing vectorized code (and this was indeed the reason why parallelization was 
attempted in our simulation). 
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B. Force-Decomposition algorithm 



The force-decomposition (FD) algorithm, which was first implemented by Hcndrickson and Plimp- 
ton for short-range forces, was motivated by block-decomposition of matrix for solving linear algebra 
problems. An important aspect of this algorithm is that the communication between processors 
scales as N/\/~P rather than N for the atom-decomposition algorithm. For the FD algorithm the 
processors are assumed to be arranged in cartesian topology (see ref. [8]). Also, we use the following 
notations : the cartesian co-ordinates of the processor is denoted by the subscripts (J, k) (j is a row 
index, and k the column index) and runs from to \/~P '— 1. Thus, tj^ and f,^ are the position and 
total force vectors held by the processor with co-ordinates (j, k) . We also use notation in which row 
(column) index in the subscript is absent which implies that the same vector is held by all processors 
in the given column (row) index. As an example, is held by all processors in the j-th row, and 
r by all processors in the fc-th column. This typically arises when a vector is expanded along a row 
or column of processors (see appendix A). Also, for simplicity we assume that P is an even power 
of 2, and N is an multiple of P. 

The block decomposition of the force matrix is performed not on F but on a new matrix F 
obtained by using vector r and permuted vector r . The permuted vector r is formed by lining up 

r jtk held by each processors column wise i.e., r = (r ,o, ri )0 , Vp-i,o> ""M' ""i. 1 ' — ■> Vp-i,1' )■ 

Note that the vector r is distributed as r = (r ,o,ro,i, r Q v /p_ 1 ,ri ) o,ri ) i, r 1 ^rp_ ll )■ Both 

these vectors are shown in Fig. 2. The elements of the force matrix ¥[■ is the force between particles 
at fi and r-. 

To calculate the elements of the sub-block F • k , two sub- vectors of length N/ \fP each from r and 
r are required. For a processor with cartesian co-ordinates (j, k), these vectors are in fact obtained 
by expand operation across j-th row and fc-th column. These vectors are represented as Tj t and r k . 
As an example, from Fig. 2, the vectors and r k are given below for all the 4 processors : 

Processor : r , = {1,2,3,4,5,6} and r = {1,2,3,7,8,9} 
Processor 1 : r ^ = {1,2,3,4,5,6} and r i = {4,5,6,10,11,12} 
Processor 2 : n] = {7,8,9,10,11,12} and r = {1,2,3,7,8,9} 
Processor 3 : n, = {7,8,9,10,11,12} and r i = {4,5,6,10,11,12} 

The force is then calculated between the elements of Vj, and r' fc , and summed in a sub- vector 

fj k which is of length N/y/P. The total force sub- vector f,^ is obtained by a fold operation on 

f • k across the j-th row of processors. Note that the expand and fold operations are along a row 

or column requiring only \fP processors, unlike AD algorithm where all P processors are involved. 
This leads to an overall communication cost which scales as N/y/P only unlike N in AD algorithms. 
The algorithm is summarised in the following steps : 

1. Expand vector rj.k along j-th row, and construct in each processor. 

2. Expand vector r^fc along fc-th column, and construct r' fc in each processor. 

3. Compute the sub-block of F^- k using and r k , and obtain i- k . 

4. Fold vector f • k across j-th row of processors and obtain fj^. 

5. Update the sub-vector rj^ in each processor using fj^. 
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Symbolically, the first four steps of the above algorithm can be represented as 

T j, = E P, k [*j\k] 

f 'j,k = r j, ® F ' • r !/c 

f jife = R Pj ,[f; fe ] (5) 

Note that there are 3 communication processes, 2 expand, and 1 fold operation involved in this 
algorithm. 

This algorithm does not make use of Newton's law and is designated as Fl algorithm. Using 
Newton's law leads to halving of the computation at the expense of increased inter-processor com- 
munication. Instead of using F for obtaining a permuted matrix, one can use the checkerboard 
matrix G described for A2 algorithm. Such a matrix is shown in Fig. 3 for P = 4 and N = 12. 
Using G modifies the Fl algorithm in step 3 and 4. In step 3, the elements of G^ k are computed 

and accumulated in i th position of f'j k and j th position of f'- k . For example, these vectors are given 
explicitly below for the two processors shown in Fig. 3 : 



Processor 



^0,0 


f(3,0 


^1,3 + Fl.7 + ^1,9 


^2,1 + F4,l + ^6,1 


^2,1 + -p2,8 


^3,2 + F 5 ,2 


-p3,2 + -p3,7 + ^3,9 


Fl,3 + -P4,3 + -P6,3 


^4,1 + -^4,3 + ^4,8 


-P\,7 + ^3,7 + ^5,7 


^5,2 + ^5,7 + ^5,9 


F2,8 + F4.8 + ^6,8 


^6,1 + -p6,3 + -^6,8 


^1,9 + -p3,9 + ^5,9 









Fl,5 + 


^1,11 




F 2 ,4 + 


-p2,6 " 


f- ^2,10 + ^2,12 


F 3 , 5 + 


-p3,ll 




F 4 ,6 + 


-Pi.lO 


+ ^4,12 


F 5A + 


-^5,11 




F 6 , 5 + 


-^6,10 


+ ^6,12 


The 


fj k is folded across 



Processor 1 



fo,i 



-p2,4 + ^5,4 

^1,5 + ^3,5 + -p6,5 

-p2,6 + ^4,6 

-p2,10 + ^4,10 + ^6,10 

-Pi, 11 + ^3,11 + ^5,11 

^2,12 + ^4,12 + -^6,12 



of processors to obtain f k . The total force fj^ is obtained by subtracting f • from f k element by 
element. In the above example, folding the vector f and f along the row gives for particle 1 

-Pi, 3 + -Pi, 7 + ^1,9 + -Pi, 5 + -Pi, 11 



'0,0 *0,1 



whereas same operation on f and i x along the column gives 

-P2,l + -P4,l + -Pfe,! + -Pkl + ^10,1 + -Pl2,l 
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The total force on particle 1 is obtained by subtracting the second sum from the first. This also 
brings out an important feature that all elements of the force matrix for an i-th particle is calculated 
by those processors which share the row and column with the processor to which i-th particle is 
assigned. The F2 algorithm can be enumerated as follows : 

1. Expand vector rj,k along j-th row, and construct rj, in each processor. 

2. Expand vector Tj t k along fc-th column, and construct r k in each processor. 

3. Compute the sub-block k and accumulate them in t'. k and i- k . 

4. Fold vector f . k across j-th row of processors and and obtain . 

5. Fold vector f'- k across fc-th column of processors and obtain f" k . 

6. Subtract f k from f • , and obtain fj t k 

7. Update the sub- vector rj t k in each processor using f^fc. 
Symbolically, the F2 algorithm has the following structure 





= Ep fc [v 3 .k] 




= E P 3 , [Tj,k] 




= V 3, ® G' • Y k 








= r p [ f j,fc] 







(6) 

The increased communication and computation is evident in step 5 and 6 at the expense of reduced 
force computation. 

IV. RESULTS 

We have implemented all the four algorithms Al, A2, Fl and F2 on a 4 processor (each of 
180MHz) SGI system. The first order differential equation (1) is solved using a fourth order predictor- 
corrector scheme. This requires evaluating the right hand side of Eq.(l) twice at each time step. 
All the calculation is done in double precision. Since the force between two particles is Bessel 
function of order 1 (.Ki(r)), evaluating the function at each time step is expensive. We follow the 
usual practice of tabulating the function at regular interval over the full range. The force at any 
distance is then obtained by interpolating the values in the table. The program is written in Fortran 
90 whereas the inter-processor communication is handled using MPI. For the AD algorithms, the 
default communicator world is used which does not specify any topology for the processors. For FD 
algorithms, the cartesian topology is imposed leading to a grid of 2 x 2 processors. The number of 
particles tested ranged from 120 to 4800. The idea here is not to simulate large particles but to show 
the possibility of parallel computation and its advantage even on a small number of processors. 

Fig. 4(a) shows the CPU time required for a single MD time step with increasing density of 
particles for different algorithms using all 4 processors. The time for Al algorithm and Fl is nearly 
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equal. Note that both these algorithms computes the inter-particle force twice. The number of 
communications for Al is less than that for Fl at the expense of increased force computation. 
On the other hand, for A2 and F2 algorithms, the number of communications are same, and A2 
outperforms F2 due to less computation required. Note that with increasing number of processors, 
F2 is expected to perform better [2] as communication cost for this algorithm goes as 0(N/\/~P) 
rather than O(N) for A2. 

Though, the total number of processors available is too small to obtain the trends with varying 
number of processors, it would still be instructive to plot the CPU time as a function of P for a 
system of 4800 particles and is shown in Fig. 4(b). The dotted line represents the ideal speed up, 
which is obtained by assuming that the time required for a single processor is equally divided among 
all available processors. 

V. CONCLUSIONS 

This report shows that considerable gain in run time can be achieved on implementing parallel 
MD simulation even for a moderate number of processors. Though the algorithms that have been 
implemented here were used by others for short-range MD, present work shows that it can be used 
even for forces that are inter- mediate range with advantage over vectorized algorithms. As stressed 
in this report, the main advantage is its simplicity in implementing and requiring few modifications 
for an already vectorized simulation code for a sequential machine. 

For this work, the emphasis is simply on the parallelization of the basic MD simulation. In 
the implementation reported here, a significant fraction of the computing time goes in searching for 
particles within the cut-off distance for the force. It is to be seen as to how the conventional methods 
of MD, which uses nearest neighbour tables and link-cell method would enhance the performance. 
As stressed before, the problem at hand is that of driven particles in presence of quenched random 
disorder. At depinning velocities when a fraction of the particles move, the neighbourhood of a 
particle changes rapidly with time. This would call for frequent updating of the nearest neighbour 
table. Also, maintaining such a table for long range force may affect the overall performance which 
need to be verified. Other approximate methods for long-range force [5,6] need to be explored though 
these algorithms are difficult to implement on distributed memory machines. 

APPENDIX A 

The expand operation on vectors x,t = {xi}^ 1 (the rank of the processor is denoted by sub- 
script k) held by P processors results in a vector y = {yi}f=i in all processors such that 

y = (xi,X2,X3, xp). The length of the vector y in all the P processors is N — NkP- Thus, 

vector y is constructed by arranging vector x in ascending order of the rank of the processors. 
Symbolically, this operation can be represented as 

Ep[x fc ]=y (7) 

where x = {xi}^ 1 and y = {yij^f . Thus after an expand operation, the output buffer in each 
processor has a vector of length which is P times the length of the vector held by each processor. 

For a 2D grid of processors, if (J, k) is the cartesian co-ordinates, an expand operation across j-th 
row can be symbolically represented as 

Ep, fc [xj-.fc] - y,-, (8) 

and that along the fc-th column as 
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(9) 



The expand operation along the row (column) returns a vector of length Nk time the number of 
processors in a column (row). 

Fox et al. [7] has prescribed an elegant and simple method to perform this operation (also see 
ref. [1]). In the first step, each processor allocates memory for the full vector y, and the vector 
Xfc is mapped to its position in y. Thus, at the beginning of the expand, fc-th processor has y = 
(0, 0, ....x fe , ...0). In the first communication step, each processor partners with the neighbouring 
processor, and exchange non-zero sub-pieces. At the end of this step, y = (0,0, ....x k ,x k+1 ...0) 
for the fc-th and (fc + l)-th processors. Every processor thus obtains a contiguous vector of length 
2Nk- In the next step, every processor partners with a processor that is two positions away, and 
exchanges the new piece (of length 2N}S). This leads to y = (0,0, ....x k ,x k+1 ,x k+2 ,x k+3 ....0) in each 
of fc, fc + 1, fc + 2, fc + 3 processors. This steps leads to each processor acquiring 4Nk contiguous vector 
length. This procedure is repeated till each processor communicates with a processor that is P/2 
position away, and at the end of which the entire vector y is acquired by all the P processors. Fig. 
Al shows the procedure for P = 4 and Nk = 3. 

For the expand operation given above, there are log 2 P messages and A^P — Nk length of vector 
exchanged. This is an optimal value. Allocation of N^P memory in each of the processing element 
can become prohibitive when Nk and P are large, though for most purpose this is never a real 
concern. Note that the number of processors involved in expand operation can be a sub-set of total 
number of available processors (see FD algorithms). 



APPENDIX B 



The fold operation between P processors is the inverse of expand. If each processor holds a vector 
y = {Ui\iLi where N = NkP, the fold operation between P processors leads to fc-th processor 
acquiring a vector x = {^i}iI=V The vector x is constructed by summing up vector y across all P 

processors and allocating the fc-th segment of it. Thus, if y J = (y{,y2,y3, yp) where y 3 k is the 

fc-th segment of y (of length Nk) in j-th processor, after a fold operation the fc-th processor receives 
x/j = J2j=i y°k- Symbolically, we represent this operation as 

Rp[y]=x fc (10) 

where y = {yi}f =1 and x = {x{\fl^ . Thus after a fold operation, the output buffer has a vector of 
length N / P where N is the vector length used for folding. 

The expand algorithm by Fox et al. can be inverted to obtain a simple and optimal method to 
perform fold operation (also see ref. [1]). In the first step, each processor pairs up with processor 
that is P/2 position away, and exchanges pieces of vector length N/2. The pieces that are received 
are the one that the processor must acquire as a final sum. Thus, processor pairs up with P/2 

and sends vector segment y = {yi} N _ K , and receives y = {yi}fJi, whereas the P/2-th processor 

l— 2 +1 

sends vector segment y = {y{\fli and receives y = {yi}^_N_ +1 - In the next step, the processor 

pairs up with the processor at P/4 and exchanges N/4 length of the new vector y, again noting that 
each processor in the pair receives that part which it requires in the summation. This procedure is 
done recursively, and in each step the vector length exchanged is halved. Fig. A2 shows the entire 
procedure for P = 4 and N = 12. This algorithm is again an optimal one requiring log 2 P steps and 
requiring exchange of N — N/P length of vector. 
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FIGURE CAPTIONS 



Fig.l The checkerboard matrix used for computing in A2 algorithm. 

Fig. 2 (a) The arrangement of processors and the vectors r and r for Fl and F2 algorithms. Note 
that the vector r is obtained by expanding first along column and then across the row, whereas 
the original vector r is generated by first expanding across a row and then along a column. 
The block-decomposition of the force matrix is shown in (b) . 

Fig. 3 The block-decomposition of the force matrix for F2 algorithm. Note the vector on the top r 
and compare it with that in Fig.l. The permuted vector leads to change in the checkerboard 
arrangement of the force matrix. 

Fig. 4 (a) The CPU time taken for a single MD step by different algorithms for a 4 processor system. 

(b) The CPU time taken for a single MD step by AD and FD algorithms with increasing 
number of processors in a simulation of 4800 particles . The dotted line is the ideal speed up 
and is the time for a single processor divided by the number of processors. 

Fig. Al The expand operation across 4 processors ranked to 3 for a vector with 3 elements. 
In the first step, each processor partners with nearest processor and exchanges sub-pieces. 
Thus gives elements {1,2,3}, and receives {4,5,6} elements of 1. After this, both and 1 
has contiguous array of elements {1,2,3,4,5,6}. Similarly 2 and 3 has {7,8,9,10,11,12}. In the 
second step, the processor pairs with those situated 2 positions away and repeats the exchange 
of non-zero pieces thus generating the full vector. There are only log 2 P steps. 

Fig. A2 The fold operation is an inverse of expand. This is shown here for P = 4 and Xfc with 
3 elements (total vector length N = 12). In the first step, each processor pairs with the 
processor that is P/2 position away and sends N/2 sub-piece in which it is not a member. 
Thus, processor gets the first half of 2 and sends it the second half. The pieces are added up 
and stored in the same vector (the stacking of the pieces sidewise in the figure implies addition). 
In the second step, the processor partners with those at position P/4 away, and receives the 
N/4 piece in which it is a member. Thus processor receives the {1,2,3} of processor 1, and 
sends it the elements {4,5,6}. Note that the elements {1,2,3} is the sum of elements held by 
processor 2 and 4. The fold operation also requires only log 2 P steps. 
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